16 research outputs found
Action Recognition from Single Timestamp Supervision in Untrimmed Videos
Recognising actions in videos relies on labelled supervision during training,
typically the start and end times of each action instance. This supervision is
not only subjective, but also expensive to acquire. Weak video-level
supervision has been successfully exploited for recognition in untrimmed
videos, however it is challenged when the number of different actions in
training videos increases. We propose a method that is supervised by single
timestamps located around each action instance, in untrimmed videos. We replace
expensive action bounds with sampling distributions initialised from these
timestamps. We then use the classifier's response to iteratively update the
sampling distributions. We demonstrate that these distributions converge to the
location and extent of discriminative action segments. We evaluate our method
on three datasets for fine-grained recognition, with increasing number of
different actions per video, and show that single timestamps offer a reasonable
compromise between recognition performance and labelling effort, performing
comparably to full temporal supervision. Our update method improves top-1 test
accuracy by up to 5.4%. across the evaluated datasets.Comment: CVPR 201
Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video
Manual annotations of temporal bounds for object interactions (i.e. start and
end times) are typical training input to recognition, localization and
detection algorithms. For three publicly available egocentric datasets, we
uncover inconsistencies in ground truth temporal bounds within and across
annotators and datasets. We systematically assess the robustness of
state-of-the-art approaches to changes in labeled temporal bounds, for object
interaction recognition. As boundaries are trespassed, a drop of up to 10% is
observed for both Improved Dense Trajectories and Two-Stream Convolutional
Neural Network.
We demonstrate that such disagreement stems from a limited understanding of
the distinct phases of an action, and propose annotating based on the Rubicon
Boundaries, inspired by a similarly named cognitive model, for consistent
temporal bounds of object interactions. Evaluated on a public dataset, we
report a 4% increase in overall accuracy, and an increase in accuracy for 55%
of classes when Rubicon Boundaries are used for temporal annotations.Comment: ICCV 201
Learning Action Changes by Measuring Verb-Adverb Textual Relationships
The goal of this work is to understand the way actions are performed in
videos. That is, given a video, we aim to predict an adverb indicating a
modification applied to the action (e.g. cut "finely"). We cast this problem as
a regression task. We measure textual relationships between verbs and adverbs
to generate a regression target representing the action change we aim to learn.
We test our approach on a range of datasets and achieve state-of-the-art
results on both adverb prediction and antonym classification. Furthermore, we
outperform previous work when we lift two commonly assumed conditions: the
availability of action labels during testing and the pairing of adverbs as
antonyms. Existing datasets for adverb recognition are either noisy, which
makes learning difficult, or contain actions whose appearance is not influenced
by adverbs, which makes evaluation less reliable. To address this, we collect a
new high quality dataset: Adverbs in Recipes (AIR). We focus on instructional
recipes videos, curating a set of actions that exhibit meaningful visual
changes when performed differently. Videos in AIR are more tightly trimmed and
were manually reviewed by multiple annotators to ensure high labelling quality.
Results show that models learn better from AIR given its cleaner videos. At the
same time, adverb prediction on AIR is challenging, demonstrating that there is
considerable room for improvement.Comment: CVPR 23. Code and dataset available at
https://github.com/dmoltisanti/air-cvpr2
BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis
Generative models for audio-conditioned dance motion synthesis map music
features to dance movements. Models are trained to associate motion patterns to
audio patterns, usually without an explicit knowledge of the human body. This
approach relies on a few assumptions: strong music-dance correlation,
controlled motion data and relatively simple poses and movements. These
characteristics are found in all existing datasets for dance motion synthesis,
and indeed recent methods can achieve good results.We introduce a new dataset
aiming to challenge these common assumptions, compiling a set of dynamic dance
sequences displaying complex human poses. We focus on breakdancing which
features acrobatic moves and tangled postures. We source our data from the Red
Bull BC One competition videos. Estimating human keypoints from these videos is
difficult due to the complexity of the dance, as well as the multiple moving
cameras recording setup. We adopt a hybrid labelling pipeline leveraging deep
estimation models as well as manual annotations to obtain good quality keypoint
sequences at a reduced cost. Our efforts produced the BRACE dataset, which
contains over 3 hours and 30 minutes of densely annotated poses. We test
state-of-the-art methods on BRACE, showing their limitations when evaluated on
complex sequences. Our dataset can readily foster advance in dance motion
synthesis. With intricate poses and swift movements, models are forced to go
beyond learning a mapping between modalities and reason more effectively about
body structure and movements.Comment: ECCV 2022. Dataset available at https://github.com/dmoltisanti/brac
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
First-person vision is gaining interest as it offers a unique viewpoint on
people's interaction with objects, their attention, and even intention.
However, progress in this challenging domain has been relatively slow due to
the lack of sufficiently large datasets. In this paper, we introduce
EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32
participants in their native kitchen environments. Our videos depict
nonscripted daily activities: we simply asked each participant to start
recording every time they entered their kitchen. Recording took place in 4
cities (in North America and Europe) by participants belonging to 10 different
nationalities, resulting in highly diverse cooking styles. Our dataset features
55 hours of video consisting of 11.5M frames, which we densely labeled for a
total of 39.6K action segments and 454.3K object bounding boxes. Our annotation
is unique in that we had the participants narrate their own videos (after
recording), thus reflecting true intention, and we crowd-sourced ground-truths
based on these. We describe our object, action and anticipation challenges, and
evaluate several baselines over two test splits, seen and unseen kitchens.
Dataset and Project page: http://epic-kitchens.github.ioComment: European Conference on Computer Vision (ECCV) 2018 Dataset and
Project page: http://epic-kitchens.github.i